Functional unknomics: Systematic screening of conserved genes of unknown function

The human genome encodes approximately 20,000 proteins, many still uncharacterised. It has become clear that scientific research tends to focus on well-studied proteins, leading to a concern that poorly understood genes are unjustifiably neglected. To address this, we have developed a publicly available and customisable “Unknome database” that ranks proteins based on how little is known about them. We applied RNA interference (RNAi) in Drosophila to 260 unknown genes that are conserved between flies and humans. Knockdown of some genes resulted in loss of viability, and functional screening of the rest revealed hits for fertility, development, locomotion, protein quality control, and resilience to stress. CRISPR/Cas9 gene disruption validated a component of Notch signalling and 2 genes contributing to male fertility. Our work illustrates the importance of poorly understood genes, provides a resource to accelerate future research, and highlights a need to support database curation to ensure that misannotation does not erode our awareness of our own ignorance.


Supplemental Text -Statistical methods
Here we provide further details to complement the outline of the statistical methods presented in the Methods section. The approach used consists of three steps: modelling the data to associate each gene with statistical parameters; construction of an outlier region in the space of these parameters; and performing hypothesis tests to determine whether these parameters fall within this region. The material here is similarly organised into three sections describing each of these steps. In much of what follows, we consider the experimental setting (i.e. fertility, wing size etc.) to be fixed and describe general procedures that we apply, with some modifications, to each of the settings.

1
Modelling the data

General procedure
The data for each experiment takes the general form ( !" , !" ) ∈ ℝ # × ℝ $ , = 1, … , " , = 1, … , , where was the total number of genes, and ∈ {1,2}. Here !" corresponds to the th measurement taken on the th gene and the !" are associated covariates that may indicate the batch in which the measurement was taken, for example. Our goal is to identify outlying genes, and for this purpose we first construct a parametric model for the data of the form where and collect together the response of covariates respectively, = ( % , … , & ) ∈ ℝ #×& are the parameters associated the with genes and represents a collection of nuisance parameters (e.g. parameters associated with the different batches). The statistical problem at hand then is to identify outlying " . For this we need to introduce a notion of what it means to be an outlier, and then propose a methodology for testing for each whether " is an outlier. These latter two tasks are described in Sections 2 and 3. To do these, we require estimates ( 9 " ) "(% & of ( " ) "(% & that are approximately unbiased and Gaussian with estimated variance Σ 9 ∈ ℝ (#⋅&)×(#⋅&) . Note that as we are only interested in differences between different " , we are for example free to introduce a sum-tozero constraint on these parameters to reduce the overall variance, and we do this throughout.
Below we present the specific statistical models used for each of the different experimental datasets for which (versions of) maximum likelihood estimation then delivers these quantities. All computations were performed in R [1].

Fertility
Let !", ∈ ℤ be the th brood size measurement corresponding to female flies with gene type in batch for = 1, … , ", (where ", may be 0 for some ( , )). We will first present our analysis of the data on females; analysis for the data on males proceeded similarly.
To examine the mean-variance relationship (i.e. how ( !", ) relates to Var( !", )), we first formed for all ( , ) such that ", ≥ 2, We then regressed ", on to ", via the following optimisation: The lack of intercept in this regression encodes the restriction that when ( !", ) = 0 we must have Var( !", ) = 0; the use of the absolute value rather than the more usual squared error loss is to account for the exponential-type tails we may expect for the ", ; and the weights ", reflect the variance of the ", .
We thus obtained an estimated variance function 9 ( ) = I % + I . . such that 9 ( !", ) ≈ Var( !", ). We obtained coefficients and as I . was negative, we were able to express 9 as a scaled version of a Bernoulli variance function via To fit a regression model with this form of variance function, we used a quasibinomial regression after transforming the data !", ↦ Z !", = | I . | !", / I % . The transformed data took values in [0,1] so we used a logit link and modelled the mean Z !", as log a Z !", To handle zero counts we used the bias correction of [2], as implemented in [3] , which always produces finite parameter estimates. The analysis of the male data was very similar, and in the end we obtained estimates ( 9 " ) "(% & and block diagonal estimated variance matrix 9 (as the male and female data were independent).

Wing size
Let !", ∈ ℝ . be the th measurement on the th gene in the th batch, defined for = 1, … , ", (where ", may be 0 for some ( , )) with first and second components denoting measurements for anterior and posterior wing segments respectively. We used the model (0, " ) with Σ " ∈ ℝ .×. . Inspection of the data showed that the correlation matrices corresponding to the Σ " vary very little over and the difference is barely detectable by permutation tests. We therefore constrained Σ " in the following way: Σ " = " %/. Σ univ " %/. where Σ univ ∈ ℝ .×. is a universal correlation matrix and the " ∈ ℝ .×. are diagonal matrices with variances corresponding to each gene.

PolyQ aggregates
We first constructed from the available data two quantities from each replicate: the number of aggregates with area in pixels greater than 50, and the corresponding number with area less than or equal to 50. They form the components of !", ∈ ℤ . , for which we use quasi-Poisson models with log | links as follows.
We performed two separate Poisson regressions for each component of the response. In order to avoid issues where parameter estimates from standard maximum likelihood estimation were too large, we employed the bias correction of [2], as implemented in [3]. To estimate the covariance matrix of the parameters, we noted that the working residuals from the regressions displayed a covariance that was constant across fitted values from each of the regressions. Using this estimated covariance and estimated dispersion parameters we formed a full covariance matrix Σ 9 ∈ ℝ (.⋅&)×(.⋅&) for all ( 9 " ) "(% & .

Survival under stress
Let !",5 and !",5 6 denote the censored survival and censoring times under oxidative stress for the th replicate of gene in batch and wheel . We fitted a Cox proportional hazards model of the form where ℎ !",5 is the hazard function of the unobserved uncensored version !",5 * of !",5 , and ℎ , is an unspecified baseline hazard function for batch .
We used an analogous model for the data concerning survival times under starvation.

Dataset
Transform function Fertility logistic Wing size identity PolyQ aggregates exponential Survival under Stress identity Climbing speed identity Let us write " = ( " ) for each = 1, … , . We considered as fixed, and as is common in the analysis of outliers, considered a model for the " as samples from a mixture of a normal distribution and an outlier distribution out [4]: We assumed the mixture proportion to be greater than 0.5 and that the support of out was sufficiently far from . We used the minimum covariance determinant estimator [5], as implemented in [6], to give a robust estimate ̂ of and an initial estimate Σ 9 9 of Σ 9 . Whilst we can expect that ̂ is a reasonable estimate of , Σ w 9 will be substantially inflated by the sampling variability of the (̂") "(% & .
To correct for this, we employed the following bootstrap strategy. 3. Compute robust covariance estimates Σ w 9 (%) , … , Σ w 9 (;) based on each of the bootstrap samples (̂" (%) ) "(% & , … , (̂" (;) ) "(% of the procedure above, which mimics the sampling distribution of the ( 9 " ) "(% & , would typically be more reliable than the analogous approximation for the (̂") "(% & . Given our final estimates ̂ and Σ 9 9 , we set the outlier region to be the complement of the elliptical contour of a # (, Σ 9 9 ) density such that the probability of ∼ # (, Σ 9 9 ) falling within the region is given by 0.05 or 0.1, depending on the dataset. This outlier region can be mapped to the -space using the inverse of ; in the sequel we will refer to this region as .

Testing for outliers
Given outlier region such that all for which " ∈ are deemed outliers, we constructed for each an (approximate) -value " for the null hypothesis " ∉ . In the cases where the region was an interval, this was straightforward. In the cases where the region was two-dimensional, this was done using a bootstrap scheme, the main steps of which were as follows. Denote by the complement of , and also let 9 be the (elliptical) region ( ).
1. Compute via the delta method an estimate Ω • " of the variance of ".
(Details for how this is performed are given in Section 4.)